Пример: кластеризация текстов

Выборка


In [151]:
from sklearn.datasets import fetch_20newsgroups

train_all = fetch_20newsgroups(subset='train')
print train_all.target_names


['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

In [152]:
simple_dataset = fetch_20newsgroups(
    subset='train', 
    categories=['comp.sys.mac.hardware', 'soc.religion.christian', 'rec.sport.hockey'])

In [153]:
print simple_dataset.data[0]


From: erik@cheshire.oxy.edu (Erik Adams)
Subject: HELP!!  My Macintosh "luggable" has lines on its screen!
Organization: Occidental College, Los Angeles, CA 90041 USA.
Distribution: comp
Lines: 20

Okay, I don't use it very much, but I would like for it to keep working
correctly, at least as long as Apple continues to make System software
that will run on it, if slowly :-)

Here is the problem:  When the screen is tilted too far back, vertical
lines appear on the screen.  They are every 10 pixels or so, and seem
to be affected somewhat by opening windows and pulling down menus.
It looks to a semi-technical person like there is a loose connection
between the screen and the rest of the computer.

I am open to suggestions that do not involve buying a new computer,
or taking this one to the shop.  I would also like to not have
to buy one of Larry Pina's books.  I like Larry, but I'm not sure
I feel strongly enough about the computer to buy a service manual
for it.

On a related note:  what does the monitor connector connect to?

Erik



In [154]:
simple_dataset.target


Out[154]:
array([0, 0, 1, ..., 0, 1, 2])

In [155]:
print simple_dataset.data[-1]


From: dlecoint@garnet.acns.fsu.edu (Darius_Lecointe)
Subject: Re: Sabbath Admissions 5of5
Organization: Florida State University
Lines: 21

I find it interesting that cls never answered any of the questions posed. 
Then he goes on the make statements which make me shudder.  He has
established a two-tiered God.  One set of rules for the Jews (his people)
and another set for the saved Gentiles (his people).  Why would God
discriminate?  Does the Jew who accepts Jesus now have to live under the
Gentile rules.

God has one set of rules for all his people.  Paul was never against the
law.  In fact he says repeatedly that faith establishes rather that annuls
the law.  Paul's point is germane to both Jews and Greeks.  The Law can
never be used as an instrument of salvation.  And please do not combine
the ceremonial and moral laws in one.

In Matt 5:14-19 Christ plainly says what He came to do and you say He was
only saying that for the Jews's benefit.  Your Christ must be a
politician, speaking from both sides of His mouth.  As Paul said, "I have
not so learned Christ."  Forget all the theology, just do what Jesus says.
 Your excuses will not hold up in a court of law on earth, far less in
God's judgement hall.

Darius


In [156]:
print simple_dataset.data[-2]


From: scialdone@nssdca.gsfc.nasa.gov (John Scialdone)
Subject: CUT Vukota and Pilon!!!
News-Software: VAX/VMS VNEWS 1.41    
Organization: NASA - Goddard Space Flight Center
Lines: 32

I have been to all 3 Isles/Caps tilts at the Crap Centre this year, all Isles
wins and there is no justification for Vukota and Pilon to play for the Isles.
Vukota is absolutely the worst puck handler in the world!! He couldn't hit a
bull in the ass with a banjo!! Al must remember a few years back when Mick 
scored 3 goals in one period against the Caps in a 5-3 Isles win. I was there
and was astonished as was the rest of the crowd. Wake-up Al!!! Years later he's
gotten worse. He's a cheap shot artist and always ends up getting
stupid/senseless penalties. I think he would make a good police officier!!!

As for Pilon, he can't carry the puck out to center ice by himself. He either
makes a bad pass resulting in a turnover, or he attempts to bring the puck 
towards the neutral zone and skates right into an opposing skater. He can't
stay on his skates with most forwards or centers. He either falls down or 
committs a penalty. Call up somebody from Capital District AL!!!!!

As far as the playoffs, the Isles are as difficult to figure out as the Caps.
Two good teams with talent but so inconsistent. They should meet in the first
round. The Isles seem to play up to the level of their competition so they
should play well against Jersey tonite. It'll probably be another tight 1-goal
game as the last 20 games hve been for the Isles. I wish when the get a lead
they could continue to pour it on instead of settling back into a defensive
shell and letting the opposition get back in the game. Al MUST understand he
can't do with this team what he did with the 80-83 Isles. maybe Al should got
to. Where is Bobby Nystrom?? Clark Gilles?? John Tonelli?? These are the kind
of young minds we need behing the bench!!    FIRE AL!!!!

John Scialdone
SCIALDONE@NSSDCA.GSFC.NASA.GOV

**********When your ship comes in, first man takes the Sail********************




In [157]:
print len(simple_dataset.data)


1777

Признаки


In [158]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=500, min_df=10)
matrix = vectorizer.fit_transform(simple_dataset.data)

In [159]:
print matrix.shape


(1777, 3767)

Аггломеративная кластеризация (neighbour joining)


In [165]:
from sklearn.cluster.hierarchical import AgglomerativeClustering

model = AgglomerativeClustering(n_clusters=3, affinity='cosine', linkage='complete')
preds = model.fit_predict(matrix.toarray())

In [166]:
print list(preds)


[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1]

In [167]:
print matrix[0]


  (0, 877)	0.111806579628
  (0, 880)	0.11851815596
  (0, 2243)	0.0925317447412
  (0, 1136)	0.0500506718781
  (0, 2358)	0.0840797530073
  (0, 2820)	0.108042909343
  (0, 2108)	0.1235763517
  (0, 3035)	0.101544778998
  (0, 1229)	0.0736045063897
  (0, 3262)	0.130212979504
  (0, 1366)	0.0836339497399
  (0, 3305)	0.0678317940862
  (0, 551)	0.0978140030819
  (0, 1961)	0.228859994985
  (0, 619)	0.170473343334
  (0, 294)	0.0517710627222
  (0, 3322)	0.0953890204835
  (0, 2325)	0.054192064708
  (0, 620)	0.111806579628
  (0, 1831)	0.125052646182
  (0, 3285)	0.112645815931
  (0, 2422)	0.0942723610696
  (0, 301)	0.0556325669605
  (0, 855)	0.201625540883
  (0, 2868)	0.0859620897079
  :	:
  (0, 900)	0.1235763517
  (0, 354)	0.0694743738355
  (0, 2048)	0.0748844895603
  (0, 1991)	0.0711302768586
  (0, 926)	0.117420593185
  (0, 3722)	0.0950102887763
  (0, 1905)	0.0817353647899
  (0, 2273)	0.0579600218377
  (0, 3600)	0.0573701527992
  (0, 3555)	0.0622908798771
  (0, 2413)	0.102588780336
  (0, 69)	0.0685826705273
  (0, 838)	0.101544778998
  (0, 1121)	0.0644281622922
  (0, 3553)	0.0734668995949
  (0, 625)	0.0520717390913
  (0, 319)	0.100549353806
  (0, 2056)	0.0961672074361
  (0, 810)	0.0809427637022
  (0, 2991)	0.389561083053
  (0, 1846)	0.0674142631248
  (0, 2082)	0.0965671776392
  (0, 1659)	0.0722667093676
  (0, 239)	0.0950102887763
  (0, 1246)	0.390638938513

In [168]:
vectorizer.get_feature_names()


Out[168]:
[u'00',
 u'000',
 u'01',
 u'02',
 u'03',
 u'030',
 u'0358',
 u'04',
 u'040',
 u'05',
 u'06',
 u'07',
 u'08',
 u'09',
 u'10',
 u'100',
 u'1000',
 u'101',
 u'102',
 u'104',
 u'105',
 u'106',
 u'109',
 u'11',
 u'110',
 u'112',
 u'113',
 u'119',
 u'12',
 u'120',
 u'126',
 u'127',
 u'128',
 u'129',
 u'13',
 u'132',
 u'133',
 u'14',
 u'140',
 u'15',
 u'150',
 u'152',
 u'16',
 u'160',
 u'17',
 u'170',
 u'175',
 u'18',
 u'180',
 u'19',
 u'1987',
 u'1988',
 u'1989',
 u'199',
 u'1990',
 u'1991',
 u'1992',
 u'1993',
 u'1993apr14',
 u'1993apr15',
 u'1993apr16',
 u'1993apr18',
 u'1993apr19',
 u'1993apr20',
 u'1993apr5',
 u'1993apr6',
 u'1d17',
 u'1d20',
 u'1st',
 u'20',
 u'200',
 u'2000',
 u'203',
 u'21',
 u'210',
 u'22',
 u'23',
 u'230',
 u'24',
 u'240',
 u'241',
 u'25',
 u'253',
 u'256',
 u'25mhz',
 u'26',
 u'27',
 u'28',
 u'286',
 u'29',
 u'2nd',
 u'30',
 u'300',
 u'30602',
 u'31',
 u'32',
 u'33',
 u'34',
 u'35',
 u'36',
 u'37',
 u'38',
 u'386',
 u'39',
 u'3rd',
 u'40',
 u'400',
 u'403',
 u'408',
 u'41',
 u'42',
 u'43',
 u'44',
 u'45',
 u'46',
 u'47',
 u'48',
 u'486',
 u'49',
 u'4mb',
 u'4th',
 u'50',
 u'500',
 u'51',
 u'512',
 u'512k',
 u'52',
 u'53',
 u'54',
 u'542',
 u'55',
 u'56',
 u'57',
 u'58',
 u'59',
 u'5of5',
 u'5th',
 u'60',
 u'600',
 u'61',
 u'610',
 u'62',
 u'63',
 u'64',
 u'65',
 u'650',
 u'66',
 u'67',
 u'68',
 u'68030',
 u'68040',
 u'680x0',
 u'69',
 u'6th',
 u'70',
 u'700',
 u'706',
 u'71',
 u'72',
 u'73',
 u'74',
 u'7415',
 u'75',
 u'76',
 u'77',
 u'78',
 u'79',
 u'7th',
 u'80',
 u'800',
 u'80ns',
 u'81',
 u'82',
 u'83',
 u'84',
 u'85',
 u'86',
 u'87',
 u'88',
 u'89',
 u'90',
 u'900',
 u'91',
 u'92',
 u'93',
 u'94',
 u'95',
 u'950',
 u'96',
 u'97',
 u'9760',
 u'98',
 u'99',
 u'__',
 u'___',
 u'____',
 u'aa888',
 u'aaron',
 u'ab',
 u'abc',
 u'ability',
 u'able',
 u'abo',
 u'above',
 u'abraham',
 u'absence',
 u'absolute',
 u'absolutely',
 u'absolutes',
 u'ac',
 u'academic',
 u'acc',
 u'accelerator',
 u'accept',
 u'acceptable',
 u'accepted',
 u'accepting',
 u'access',
 u'according',
 u'account',
 u'accounts',
 u'accurate',
 u'achieve',
 u'achkar',
 u'acknowledge',
 u'acns',
 u'acquired',
 u'across',
 u'acs',
 u'acsu',
 u'act',
 u'action',
 u'actions',
 u'active',
 u'acts',
 u'actual',
 u'actually',
 u'ad',
 u'adam',
 u'adams',
 u'adapter',
 u'adb',
 u'add',
 u'added',
 u'adding',
 u'addition',
 u'additional',
 u'address',
 u'addresses',
 u'adirondack',
 u'admissions',
 u'admit',
 u'adopt',
 u'advance',
 u'advanced',
 u'advantage',
 u'advice',
 u'affected',
 u'afford',
 u'afraid',
 u'africa',
 u'after',
 u'afternoon',
 u'again',
 u'against',
 u'age',
 u'agnostic',
 u'ago',
 u'agree',
 u'agreed',
 u'agreement',
 u'ah',
 u'ahead',
 u'ahl',
 u'ai',
 u'aids',
 u'air',
 u'aisun3',
 u'al',
 u'alan',
 u'alberta',
 u'alchemy',
 u'alexander',
 u'alive',
 u'allen',
 u'allow',
 u'allowed',
 u'allows',
 u'almost',
 u'alone',
 u'along',
 u'alot',
 u'alpha',
 u'already',
 u'also',
 u'alt',
 u'alternative',
 u'although',
 u'altogether',
 u'alvin',
 u'always',
 u'am',
 u'am2x',
 u'amateur',
 u'amazing',
 u'ambiguous',
 u'america',
 u'american',
 u'americans',
 u'among',
 u'amount',
 u'amour',
 u'analysis',
 u'ancient',
 u'anderson',
 u'andersson',
 u'andrew',
 u'andreychuk',
 u'andy',
 u'angeles',
 u'angels',
 u'anger',
 u'angry',
 u'animals',
 u'ann',
 u'anna',
 u'announce',
 u'announced',
 u'announcer',
 u'announcers',
 u'annoying',
 u'another',
 u'answer',
 u'answered',
 u'answering',
 u'answers',
 u'antonio',
 u'anybody',
 u'anymore',
 u'anyone',
 u'anything',
 u'anytime',
 u'anyway',
 u'anyways',
 u'anywhere',
 u'apart',
 u'apologize',
 u'apostle',
 u'apostles',
 u'apparent',
 u'apparently',
 u'appear',
 u'appeared',
 u'appears',
 u'apple',
 u'applelink',
 u'appletalk',
 u'application',
 u'applications',
 u'applied',
 u'applies',
 u'apply',
 u'appointed',
 u'appreciate',
 u'appreciated',
 u'approach',
 u'appropriate',
 u'approved',
 u'apr',
 u'april',
 u'aps',
 u'aquinas',
 u'arbor',
 u'area',
 u'areas',
 u'aren',
 u'arena',
 u'argue',
 u'arguing',
 u'argument',
 u'arguments',
 u'army',
 u'around',
 u'arrogance',
 u'arrogant',
 u'art',
 u'articles',
 u'artificial',
 u'arts',
 u'ashley',
 u'aside',
 u'ask',
 u'asked',
 u'asking',
 u'asks',
 u'aspect',
 u'aspects',
 u'ass',
 u'assertion',
 u'assist',
 u'assistant',
 u'assists',
 u'associate',
 u'associated',
 u'association',
 u'assume',
 u'assumed',
 u'assuming',
 u'assumption',
 u'astray',
 u'atheism',
 u'atheist',
 u'atheists',
 u'athena',
 u'athens',
 u'athos',
 u'atlanta',
 u'att',
 u'attack',
 u'attempt',
 u'attempted',
 u'attempting',
 u'attempts',
 u'attendance',
 u'attention',
 u'atterlep',
 u'attitude',
 u'au',
 u'audience',
 u'austin',
 u'australia',
 u'author',
 u'authorities',
 u'authority',
 u'authors',
 u'auto',
 u'available',
 u'ave',
 u'avenue',
 u'average',
 u'avoid',
 u'awarded',
 u'awards',
 u'aware',
 u'away',
 u'awesome',
 u'awful',
 u'axelsson',
 u'back',
 u'background',
 u'backup',
 u'bad',
 u'baker',
 u'ball',
 u'ballentine',
 u'baltimore',
 u'band',
 u'baptism',
 u'barrasso',
 u'base',
 u'baseball',
 u'based',
 u'basic',
 u'basically',
 u'basis',
 u'basketball',
 u'battery',
 u'battle',
 u'bay',
 u'bbs',
 u'bc',
 u'bear',
 u'bears',
 u'beat',
 u'beautiful',
 u'became',
 u'because',
 u'become',
 u'becomes',
 u'becoming',
 u'been',
 u'before',
 u'beg',
 u'began',
 u'begin',
 u'beginning',
 u'begins',
 u'behalf',
 u'behavior',
 u'behaviour',
 u'behind',
 u'being',
 u'beings',
 u'belfour',
 u'belief',
 u'beliefs',
 u'believe',
 u'believed',
 u'believers',
 u'believes',
 u'believing',
 u'bell',
 u'bellows',
 u'belong',
 u'below',
 u'bench',
 u'benefit',
 u'beranek',
 u'berkeley',
 u'bernoulli',
 u'beside',
 u'besides',
 u'best',
 u'bet',
 u'better',
 u'between',
 u'beyond',
 u'bgsu',
 u'bgu',
 u'biased',
 u'bible',
 u'biblical',
 u'big',
 u'bigger',
 u'biggest',
 u'bill',
 u'birth',
 u'bishop',
 u'bit',
 u'bitnet',
 u'bits',
 u'black',
 u'blackhawks',
 u'blame',
 u'bless',
 u'blessed',
 u'blind',
 u'blindly',
 u'block',
 u'blood',
 u'blow',
 u'blue',
 u'blues',
 u'bnr',
 u'board',
 u'boarding',
 u'boards',
 u'bob',
 u'bobby',
 u'body',
 u'book',
 u'books',
 u'bookstore',
 u'boot',
 u'born',
 u'bos',
 u'boss',
 u'boston',
 u'both',
 u'bother',
 u'bothers',
 u'bottom',
 u'bought',
 u'boulder',
 u'bound',
 u'boundary',
 u'bourque',
 u'bowling',
 u'box',
 u'boxes',
 u'boy',
 u'brad',
 u'bradley',
 u'brain',
 u'branch',
 u'brand',
 u'braves',
 u'break',
 u'breaker',
 u'breaking',
 u'brent',
 u'breton',
 u'bri',
 u'brian',
 u'brief',
 u'brind',
 u'bring',
 u'brings',
 u'british',
 u'broadcast',
 u'broke',
 u'broken',
 u'brother',
 u'brothers',
 u'brought',
 u'brown',
 u'bruce',
 u'bruins',
 u'brunswick',
 u'bryan',
 u'btw',
 u'bu',
 u'buddhism',
 u'buddhist',
 u'buf',
 u'buffalo',
 u'bug',
 u'build',
 u'building',
 u'built',
 u'bunch',
 u'bure',
 u'burnaby',
 u'burned',
 u'burns',
 u'bus',
 u'business',
 u'butt',
 u'button',
 u'buy',
 u'buying',
 u'byler',
 u'bytes',
 u'c610',
 u'c650',
 u'ca',
 u'cable',
 u'cables',
 u'cache',
 u'cadkey',
 u'cage',
 u'cal',
 u'calder',
 u'calgary',
 u'california',
 u'call',
 u'called',
 u'calling',
 u'calls',
 u'calvin',
 u'cam',
 u'came',
 u'camp',
 u'campbell',
 u'campus',
 u'canada',
 u'canadian',
 u'canadians',
 u'canadiens',
 u'cannot',
 u'canon',
 u'canucks',
 u'capability',
 u'capable',
 u'capacity',
 u'cape',
 u'capital',
 u'capitals',
 u'caps',
 u'captain',
 u'captains',
 u'car',
 u'caralv',
 u'card',
 u'cardinal',
 u'cards',
 u'care',
 u'career',
 u'careful',
 u'carefully',
 u'cares',
 u'carleton',
 u'carnegie',
 u'carol',
 u'carolina',
 u'carpenter',
 u'carried',
 u'carry',
 u'carrying',
 u'carson',
 u'cartridge',
 u'case',
 u'cases',
 u'cassels',
 u'cast',
 u'category',
 u'catholic',
 u'catholics',
 u'caught',
 u'cause',
 u'caused',
 u'causes',
 u'cb',
 u'cbc',
 u'cbnewsh',
 u'cbnewsk',
 u'cc',
 u'ccu',
 u'cd',
 u'cd300',
 u'cec1',
 u'celebrate',
 u'cent',
 u'center',
 u'central',
 u'centre',
 u'centris',
 u'centuries',
 u'century',
 u'ceremonial',
 u'ceremony',
 u'certain',
 u'certainly',
 u'certainty',
 u'cf',
 u'cgsvax',
 u'chain',
 u'champions',
 u'championship',
 u'championships',
 u'champs',
 u'chance',
 u'chances',
 u'change',
 u'changed',
 u'changes',
 u'changing',
 u'channel',
 u'chapter',
 u'character',
 u'charge',
 u'charles',
 u'charlie',
 u'cheap',
 u'cheaper',
 u'check',
 u'checked',
 u'checking',
 u'cheers',
 u'chelios',
 u'chem',
 u'chemistry',
 u'cherry',
 u'chhabra',
 u'chi',
 u'chicago',
 u'chief',
 u'child',
 u'children',
 u'chip',
 u'chips',
 u'choice',
 u'choices',
 u'choose',
 u'choosing',
 u'chose',
 u'chosen',
 u'chris',
 u'christ',
 u'christian',
 u'christianity',
 u'christians',
 u'christopher',
 u'chuck',
 u'church',
 u'churches',
 u'ciccarelli',
 u'circle',
 u'circumstances',
 u'cis',
 u'cited',
 u'cities',
 u'city',
 u'civil',
 u'claim',
 u'claimed',
 u'claiming',
 u'claims',
 u'claremont',
 u'clark',
 u'clarke',
 u'clarkson',
 u'class',
 u'classic',
 u'classy',
 u'claude',
 u'clean',
 u'clear',
 u'clearly',
 u'clement',
 u'cleveland',
 u'clh',
 u'clinton',
 u'clock',
 u'close',
 u'closed',
 u'closely',
 u'closer',
 u'closest',
 u'club',
 u'cmu',
 u'co',
 u'coach',
 u'coaches',
 u'coaching',
 u'code',
 u'coffey',
 u'cohen',
 u'col',
 u'collection',
 u'college',
 u'collingridge',
 u'color',
 u'colorado',
 u'colors',
 u'colostate',
 u'colour',
 u'columbia',
 u'com',
 u'combination',
 u'combined',
 u'come',
 u'comes',
 u'comics',
 u'coming',
 u'command',
 u'commandments',
 u'commands',
 u'comment',
 u'commentary',
 u'comments',
 u'commitment',
 u'committed',
 u'common',
 u'communication',
 u'communications',
 u'communion',
 u'community',
 u'comp',
 u'companies',
 u'company',
 u'comparable',
 u'comparative',
 u'compare',
 u'compared',
 u'comparing',
 u'comparison',
 u'compatible',
 u'complain',
 u'complete',
 u'completed',
 u'completely',
 u'complex',
 u'complicated',
 u'compuserve',
 u'computer',
 u'computers',
 u'computing',
 u'con',
 u'concept',
 u'conception',
 u'concern',
 u'concerned',
 u'concerning',
 u'conclude',
 u'conclusion',
 u'conclusions',
 u'condemned',
 u'conditions',
 u'conditt',
 u'conference',
 u'configuration',
 u'conflict',
 u'confused',
 u'confusing',
 u'confusion',
 u'congregation',
 u'connect',
 u'connected',
 u'connection',
 u'connector',
 u'consecutive',
 u'consensus',
 u'conservative',
 u'consider',
 u'considered',
 u'considering',
 u'consistent',
 u'consistently',
 u'constant',
 u'constantly',
 u'consultant',
 u'contact',
 u'contained',
 u'contains',
 u'content',
 u'context',
 u'continent',
 u'continue',
 u'continued',
 u'continues',
 u'contract',
 u'contradict',
 u'contradiction',
 u'contradictory',
 u'contrary',
 u'contribution',
 u'control',
 u'conversion',
 u'convert',
 u'converted',
 u'convictions',
 u'convince',
 u'cook',
 u'cool',
 u'coos',
 u'copies',
 u'coprocessor',
 u'copy',
 u'cor',
 u'cordially',
 u'corinthians',
 u'corner',
 u'corp',
 u'corporation',
 u'correct',
 u'correctly',
 u'cost',
 u'costs',
 u'cote',
 u'could',
 u'couldn',
 u'council',
 u'count',
 u'counted',
 u'countries',
 u'country',
 u'couple',
 u'course',
 u'court',
 u'courtnall',
 u'cover',
 u'coverage',
 u'covered',
 u'covington',
 u'cpu',
 u'craft',
 u'craig',
 u'crap',
 u'crashes',
 u'craven',
 u'crazy',
 u'create',
 u'created',
 u'creation',
 u'creator',
 u'credit',
 u'critical',
 u'criticism',
 u'cross',
 u'crowd',
 u'cs',
 u'csd',
 u'cso',
 u'cu',
 u'cullen',
 u'cult',
 u'cultural',
 u'culture',
 u'cunixb',
 u'cunixc',
 u'cup',
 u'cups',
 u'curious',
 u'current',
 u'currently',
 u'curtis',
 u'cut',
 u'cwis',
 u'cwru',
 u'czech',
 u'd88',
 u'dahlen',
 u'daily',
 u'dal',
 u'dale',
 u'dalhousie',
 u'dallas',
 u'damn',
 u'damphousse',
 u'dan',
 u'danger',
 u'dangerous',
 u'daniel',
 u'dare',
 u'darius',
 u'darius_lecointe',
 u'darren',
 u'dartmouth',
 u'daryl',
 ...]

In [169]:
vectorizer.get_feature_names()[877]


Out[169]:
u'connect'

In [170]:
simple_dataset.data[0]


Out[170]:
u'From: erik@cheshire.oxy.edu (Erik Adams)\nSubject: HELP!!  My Macintosh "luggable" has lines on its screen!\nOrganization: Occidental College, Los Angeles, CA 90041 USA.\nDistribution: comp\nLines: 20\n\nOkay, I don\'t use it very much, but I would like for it to keep working\ncorrectly, at least as long as Apple continues to make System software\nthat will run on it, if slowly :-)\n\nHere is the problem:  When the screen is tilted too far back, vertical\nlines appear on the screen.  They are every 10 pixels or so, and seem\nto be affected somewhat by opening windows and pulling down menus.\nIt looks to a semi-technical person like there is a loose connection\nbetween the screen and the rest of the computer.\n\nI am open to suggestions that do not involve buying a new computer,\nor taking this one to the shop.  I would also like to not have\nto buy one of Larry Pina\'s books.  I like Larry, but I\'m not sure\nI feel strongly enough about the computer to buy a service manual\nfor it.\n\nOn a related note:  what does the monitor connector connect to?\n\nErik\n\n'

KMeans


In [172]:
from sklearn.cluster import KMeans

model = KMeans(n_clusters=3, random_state=1)
preds = model.fit_predict(matrix.toarray())
print preds


[0 0 2 ..., 0 2 1]

In [173]:
print simple_dataset.target


[0 0 1 ..., 0 1 2]

In [174]:
mapping = {2 : 1, 1: 2, 0: 0}
mapped_preds = [mapping[pred] for pred in preds]
print float(sum(mapped_preds != simple_dataset.target)) / len(simple_dataset.target)


0.0483961733258

In [175]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
clf = LogisticRegression()
print cross_val_score(clf, matrix, simple_dataset.target).mean()


0.985360318588

Более сложная выборка


In [176]:
dataset = fetch_20newsgroups(
    subset='train', 
    categories=['comp.sys.mac.hardware', 'comp.os.ms-windows.misc', 'comp.graphics'])

In [177]:
matrix = vectorizer.fit_transform(dataset.data)
model = KMeans(n_clusters=3, random_state=42)
preds = model.fit_predict(matrix.toarray())
print preds
print dataset.target


[0 1 2 ..., 0 2 0]
[2 1 1 ..., 2 0 2]

In [178]:
mapping = {2 : 0, 1: 1, 0: 2}
mapped_preds = [mapping[pred] for pred in preds]
print float(sum(mapped_preds != dataset.target)) / len(dataset.target)


0.261266400456

In [179]:
clf = LogisticRegression()
print cross_val_score(clf, matrix, dataset.target).mean()


0.917279226713

SVD + KMeans


In [180]:
from sklearn.decomposition import TruncatedSVD

model = KMeans(n_clusters=3, random_state=42)
svd = TruncatedSVD(n_components=1000, random_state=123)
features = svd.fit_transform(matrix)
preds = model.fit_predict(features)
print preds
print dataset.target


[0 2 1 ..., 0 1 0]
[2 1 1 ..., 2 0 2]

In [181]:
mapping = {0 : 2, 1: 0, 2: 1}
mapped_preds = [mapping[pred] for pred in preds]
print float(sum(mapped_preds != dataset.target)) / len(dataset.target)


0.206503137479

In [182]:
model = KMeans(n_clusters=3, random_state=42)
svd = TruncatedSVD(n_components=200, random_state=123)
features = svd.fit_transform(matrix)
preds = model.fit_predict(features)
print preds
print dataset.target


[2 0 1 ..., 2 1 2]
[2 1 1 ..., 2 0 2]

In [183]:
import itertools
def validate_with_mappings(preds, target, dataset):
    permutations = itertools.permutations([0, 1, 2])
    for a, b, c in permutations:
        mapping = {2 : a, 1: b, 0: c}
        mapped_preds = [mapping[pred] for pred in preds]
        print float(sum(mapped_preds != target)) / len(target)
        
validate_with_mappings(preds, dataset.target, dataset)


0.900741585853
0.674272675414
0.705647461495
0.893896177981
0.205362236167
0.620079863092

In [184]:
model = KMeans(n_clusters=3, random_state=42)
svd = TruncatedSVD(n_components=200, random_state=321)
features = svd.fit_transform(matrix)
preds = model.fit_predict(features)
print preds
print dataset.target
validate_with_mappings(preds, dataset.target, dataset)


[2 1 0 ..., 2 0 2]
[2 1 1 ..., 2 0 2]
0.713063320023
0.845407872219
0.889332572732
0.70051340559
0.586423274387
0.265259555048

Итоги

  1. Получили интерпретируемый результат на обеих выборках
  2. Реальность, однако, намного более жестока
  3. Попробовали использовать AgglomerativeClustering и KMeans